Latent neural processes (LNPs) #1 use a training objective that is inspired from the ELBO. The steps for the derivation are the following:
-
First, we derive an ELBO that bounds the conditional marginal likelihood instead of the usual . This marginal likelihood is more representative of the desired task in NPs, i.e. recover the behaviour of the function in unseen (target, ) regions given known (context, ) regions.
-
Second, approximate the previous ELBO, which is intractable.
Suppose we have a dataset that has been generated from a latent variable by a model . We could estimate the parameters by maximum likelihood, by maximizing the marginal likelihood of . The log marginal likelihood can be decomposed as
which results in the usual ELBO bound
Note that this is equivalent to the ELBO bound of a VAE for a single datapoint originating from a latent variable , with the only difference that here we are conditioning everything on the inputs too (by design).
Now, suppose that you partition the dataset into a context set and a target set . The goal is to infer given information from . We could try to obtain appropriate parameters by maximizing the same marginal likelihood as before, . However, this is an indirect objective, since it represents maximizing the likelihood of the entire dataset. What we really want to maximize is the conditional marginal likelihood . Following the same VAE analogy as before, this would be equivalent to reconstructing part of a datapoint based on the rest of the datapoint (for example, reconstructing the left side of an image based on the right side).
We can obtain an ELBO bound for this conditional marginal likelihood as follows. The LHS of the usual ELBO can be rewritten as
Similarly, the RHS can be rewritten as
We can obtain the desired ELBO by substracting from both sides and in the RHS. In this way, after substracting this term the LHS becomes the desired bound. The RHS becomes
where the last term cancels out:
Thus, we arrive at
which becomes the desired bound after making the (reasonable) assumption that each prediction can only be informed by its corresponding input, for example so that becomes :
We would be done except for one problem: this expression is unfortunately intractable because is intractable.
Note that, since we have substracted the same term from the marginal likelihood and the ELBO, the overall KL divergence remains the same with respect to the original one, i.e. the overall KL divergence is still
The previous ELBO is a bound of the desired marginal likelihood , but it is intractable because . LNPs circumvent this issue by approximating this term with .
The right term (the KL or regularization term) can now be interpreted in the following way: the variational distribution over the latent variable should be the same when the model has access to full information about the function () and when the model has only partial information about the function (). This seems a reasonable objective for NPs, which try to recover the whole based only on .
Note, however, that this ELBO-like objective is no longer an analytical lower bound of the conditional log marginal likelihood , so there is no guarantee that we are maximizing the likelihood of the parameters anymore.
1 Garnelo et al 2018. Neural processes.
2 Garnelo et al 2018. Conditional neural processes.